Measuring the Algorithmic Convergence of Random Forests via Bootstrap Extrapolation
نویسنده
چکیده
When making predictions with a voting rule, a basic question arises: “What is the smallest number of votes needed to make a good prediction?” In the context of ensemble classifiers, such as Random Forests or Bagging, this question represents a tradeoff between computational cost and statistical performance. Namely, by paying a larger computational price for more classifiers, the prediction error of the ensemble tends to improve and become more stable. Conversely, by using fewer classifiers and tolerating some variability in accuracy, it is possible to speed up the tasks of training the ensemble and making new predictions. In this paper, we propose a bootstrap method to quantify this tradeoff for the methods of Bagging and Random Forests. To be specific, suppose the training dataset is fixed, and let the random variable Errt denote the prediction error of a randomly generated ensemble of t = 1, 2, . . . classifiers. (The randomness of Errt comes only from the algorithmic randomness of the ensemble.) Working under a “first order model” of Random Forests, we prove that the centered law of Errt can be consistently estimated via our proposed method as t→∞. As a consequence, this result offers practitioners a guideline for choosing the smallest number of base classifiers needed to ensure that the algorithmic fluctuations are negligible, e.g. var(Errt) less than a given threshold.
منابع مشابه
Random Forests for Big Data
Big Data is one of the major challenges of statistical science and has numerous consequences from algorithmic and theoretical viewpoints. Big Data always involve massive data but they also often include data streams and data heterogeneity. Recently some statistical methods have been adapted to process Big Data, like linear regression models, clustering methods and bootstrapping schemes. Based o...
متن کاملQuantifying Uncertainty in Random Forests via Confidence Intervals and Hypothesis Tests
This work develops formal statistical inference procedures for predictions generated by supervised learning ensembles. Ensemble methods based on bootstrapping, such as bagging and random forests, have improved the predictive accuracy of individual trees, but fail to provide a framework in which distributional results can be easily determined. Instead of aggregating full bootstrap samples, we co...
متن کاملRichardson Extrapolation and the Bootstrap
Simulation methods, in particular Efron's (1979) bootstrap, are being applied more and more widely in statistical inference. Given data, (X1,* ,Xn), distributed according to P belonging to a hypothesized model P the basic goal is to estimate the distribution Lp of a function Tn (X1, * *Xn,P). The bootstrap presupposes the existence of an estimate P (X1, Xn) and consists of estimating Lp by the ...
متن کاملThe stability of feature selection and class prediction from ensemble tree classifiers
The bootstrap aggregating procedure at the core of ensemble tree classifiers reduces, in most cases, the variance of such models while offering good generalization capabilities. The average predictive performance of those ensembles is known to improve up to a certain point while increasing the ensemble size. The present work studies this convergence in contrast to the stability of the class pre...
متن کاملImpact of Measuring Devices and Data Analysis on the Determination of Gas Membrane Properties
The time-lag method, using a gas permeation experiment, is currently the most popular method for determining the membrane properties: diffusivity coefcient and permeability coefcient, and from which the solubility coefcient can be calculated. In this investigation, the impact of systematic, random (noise), resolution and extrapolation errors associated with gas permeatio...
متن کامل